Optimization (cont.)

Lecture 14

Dr. Colin Rundel

Method Variants

Method: CG in scipy

Scipy’s optimize module implements the conjugate gradient algorithm by Polak and Ribiere, a variant that does not require the Hessian,

  • \(\alpha_k\) is calculated via a line search along the direction \(p_k\)

  • We replace the previous definition \[ \beta_{k+1} = \frac{ r^T_{k+1} \, \nabla^2 f(x_k) \, p_{k} }{p_k^T \, \nabla^2 f(x_k) \, p_k} \]

with

\[ \beta_{k+1}^{PR} = \frac{\nabla f(x_{k+1}) \left(\nabla f(x_{k+1}) - \nabla f(x_{k})\right)}{\nabla f(x_k)^T \, \nabla f(x_k)} \]

Method: Newton-CG & BFGS

These are both variants of Newtons method but do not require the Hessian (but can be use by the former if provided).

  • The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm is a quasi-newton which iteratively approximates Hessian

Method: Nelder-Mead

This is a gradient free method that uses a series of simplexes which are used to iteratively bracket the minimum.

Method Summary


SciPy Method Description Gradient Hessian
Gradient Descent (naive w/ backtracking)
Newton’s method (naive w/ backtracking)
Conjugate Gradient (naive)
"CG" Nonlinear Conjugate Gradient (Polak and Ribiere variation)
"Newton-CG" Truncated Newton method (Newton w/ CG step direction) Optional
"BFGS" Broyden, Fletcher, Goldfarb, and Shanno (Quasi-newton method) Optional
"L-BFGS-B" Limited-memory BFGS (Quasi-newton method) Optional
"Nelder-Mead" Nelder-Mead simplex reflection method

Methods collection

def define_methods(x, f, grad, hess, tol=1e-8):
  return {
    "naive_newton":   lambda: newtons_method(x, f, grad, hess, tol=tol),
    "naive_cg":       lambda: conjugate_gradient(x, f, grad, hess, tol=tol),
    "CG":             lambda: optimize.minimize(f, x, jac=grad, method="CG", tol=tol),
    "newton-cg":      lambda: optimize.minimize(f, x, jac=grad, hess=None, method="Newton-CG", tol=tol),
    "newton-cg w/ H": lambda: optimize.minimize(f, x, jac=grad, hess=hess, method="Newton-CG", tol=tol),
    "bfgs":           lambda: optimize.minimize(f, x, jac=grad, method="BFGS", tol=tol),
    "bfgs w/o G":     lambda: optimize.minimize(f, x, method="BFGS", tol=tol),
    "l-bfgs-b":       lambda: optimize.minimize(f, x, method="L-BFGS-B", tol=tol),
    "nelder-mead":    lambda: optimize.minimize(f, x, method="Nelder-Mead", tol=tol)
  }

Method Timings

x = (1.6, 1.1)
f, grad, hess = mk_quad(0.7)
methods = define_methods(x, f, grad, hess)

df = pd.DataFrame({
  key: timeit.Timer(methods[key]).repeat(10, 100) for key in methods
})
  
df
   naive_newton  naive_cg        CG  ...  bfgs w/o G  l-bfgs-b  nelder-mead
0      0.029418  0.031432  0.010260  ...    0.048270  0.029618     0.100661
1      0.028514  0.032393  0.009922  ...    0.048328  0.026831     0.099941
2      0.028562  0.032846  0.009725  ...    0.048308  0.026727     0.100675
3      0.029202  0.032913  0.009430  ...    0.048968  0.026292     0.099893
4      0.029238  0.031702  0.009614  ...    0.048761  0.026301     0.100530
5      0.029393  0.032050  0.009654  ...    0.048400  0.026221     0.100207
6      0.029281  0.032417  0.009676  ...    0.048008  0.026354     0.100014
7      0.029246  0.031986  0.009658  ...    0.048347  0.026398     0.098695
8      0.029418  0.032260  0.009675  ...    0.051405  0.026557     0.102137
9      0.029312  0.032251  0.009771  ...    0.048857  0.026232     0.098924

[10 rows x 9 columns]

Timings across cost functions

def time_cost_func(n, x, name, cost_func, *args):
  x = (1.6, 1.1)  
  f, grad, hess = cost_func(*args)
  methods = define_methods(x, f, grad, hess)
  
  return ( pd.DataFrame({
      key: timeit.Timer(
        methods[key]
      ).repeat(n, n) 
      for key in methods
    })
    .melt()
    .assign(cost_func = name)
  )

df = pd.concat([
  time_cost_func(10, x, "Well-cond quad", mk_quad, 0.7),
  time_cost_func(10, x, "Ill-cond quad", mk_quad, 0.02),
  time_cost_func(10, x, "Rosenbrock", mk_rosenbrock)
])

df
        variable     value       cost_func
0   naive_newton  0.003061  Well-cond quad
1   naive_newton  0.003014  Well-cond quad
2   naive_newton  0.002951  Well-cond quad
3   naive_newton  0.002920  Well-cond quad
4   naive_newton  0.002978  Well-cond quad
..           ...       ...             ...
85   nelder-mead  0.016167      Rosenbrock
86   nelder-mead  0.016182      Rosenbrock
87   nelder-mead  0.016139      Rosenbrock
88   nelder-mead  0.016248      Rosenbrock
89   nelder-mead  0.016163      Rosenbrock

[270 rows x 3 columns]

Random starting locations

x0s = np.random.default_rng(seed=1234).uniform(-2,2, (20,2))

df = pd.concat([
  pd.concat([
    time_cost_func(3, x0, "Well-cond quad", mk_quad, 0.7),
    time_cost_func(3, x0, "Ill-cond quad", mk_quad, 0.02),
    time_cost_func(3, x0, "Rosenbrock", mk_rosenbrock)
  ])
  for x0 in x0s
])

Profiling - BFGS (cProfile)

import cProfile

f, grad, hess = mk_quad(0.7)
def run():
  optimize.minimize(fun = f, x0 = (1.6, 1.1), jac=grad, method="BFGS", tol=1e-11)

cProfile.run('run()', sort="tottime")
         1011 function calls in 0.001 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.001    0.001 _optimize.py:1328(_minimize_bfgs)
       58    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}
       10    0.000    0.000    0.000    0.000 <string>:3(f)
       26    0.000    0.000    0.000    0.000 _optimize.py:194(vecnorm)
        9    0.000    0.000    0.000    0.000 _dcsrch.py:201(__call__)
        1    0.000    0.000    0.001    0.001 _minimize.py:53(minimize)
       36    0.000    0.000    0.000    0.000 fromnumeric.py:69(_wrapreduction)
       20    0.000    0.000    0.000    0.000 numeric.py:2475(array_equal)
       10    0.000    0.000    0.000    0.000 <string>:9(gradient)
        9    0.000    0.000    0.000    0.000 _linesearch.py:100(scalar_search_wolfe1)
       18    0.000    0.000    0.000    0.000 _dcsrch.py:269(_iterate)
        9    0.000    0.000    0.000    0.000 _linesearch.py:86(derphi)
       26    0.000    0.000    0.000    0.000 fromnumeric.py:2338(sum)
        1    0.000    0.000    0.000    0.000 _differentiable_functions.py:166(__init__)
        9    0.000    0.000    0.000    0.000 _linesearch.py:37(line_search_wolfe1)
       31    0.000    0.000    0.000    0.000 {built-in method numpy.array}
       10    0.000    0.000    0.000    0.000 _differentiable_functions.py:16(wrapped)
       10    0.000    0.000    0.000    0.000 _differentiable_functions.py:39(wrapped)
        9    0.000    0.000    0.000    0.000 _optimize.py:1139(_line_search_wolfe12)
        1    0.000    0.000    0.001    0.001 <string>:1(run)
        9    0.000    0.000    0.000    0.000 _linesearch.py:82(phi)
       10    0.000    0.000    0.000    0.000 _differentiable_functions.py:323(fun)
       64    0.000    0.000    0.000    0.000 {built-in method numpy.asarray}
       20    0.000    0.000    0.000    0.000 {built-in method numpy.arange}
        1    0.000    0.000    0.001    0.001 {built-in method builtins.exec}
       10    0.000    0.000    0.000    0.000 _aliases.py:78(asarray)
        9    0.000    0.000    0.000    0.000 _differentiable_functions.py:270(_update_x)
       20    0.000    0.000    0.000    0.000 {method 'all' of 'numpy.ndarray' objects}
       52    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        1    0.000    0.000    0.000    0.000 _optimize.py:203(_prepare_scalar_function)
       11    0.000    0.000    0.000    0.000 shape_base.py:21(atleast_1d)
       10    0.000    0.000    0.000    0.000 _differentiable_functions.py:329(grad)
        1    0.000    0.000    0.000    0.000 _twodim_base_impl.py:163(eye)
       10    0.000    0.000    0.000    0.000 fromnumeric.py:3168(amax)
        1    0.000    0.000    0.000    0.000 _linalg.py:2575(norm)
       11    0.000    0.000    0.000    0.000 _differentiable_functions.py:293(_update_fun)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:89(_wrapreduction_any_all)
       20    0.000    0.000    0.000    0.000 {method 'copy' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 _differentiable_functions.py:55(_wrapper_hess)
       37    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
       20    0.000    0.000    0.000    0.000 _methods.py:67(_all)
       11    0.000    0.000    0.000    0.000 _differentiable_functions.py:303(_update_grad)
       10    0.000    0.000    0.000    0.000 {method 'astype' of 'numpy.ndarray' objects}
       21    0.000    0.000    0.000    0.000 _function_base_impl.py:898(copy)
        1    0.000    0.000    0.000    0.000 numerictypes.py:381(isdtype)
       51    0.000    0.000    0.000    0.000 multiarray.py:761(dot)
       10    0.000    0.000    0.000    0.000 _aliases.py:236(astype)
       32    0.000    0.000    0.000    0.000 {built-in method numpy.asanyarray}
       10    0.000    0.000    0.000    0.000 enum.py:202(__get__)
        9    0.000    0.000    0.000    0.000 {built-in method builtins.min}
        1    0.000    0.000    0.000    0.000 fromnumeric.py:2477(any)
        1    0.000    0.000    0.000    0.000 shape_base.py:80(atleast_2d)
       10    0.000    0.000    0.000    0.000 numeric.py:1927(isscalar)
        1    0.000    0.000    0.000    0.000 {method 'dot' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 numerictypes.py:368(_preprocess_dtype)
        9    0.000    0.000    0.000    0.000 _linesearch.py:27(_check_c1_c2)
       26    0.000    0.000    0.000    0.000 fromnumeric.py:2333(_sum_dispatcher)
        9    0.000    0.000    0.000    0.000 {method 'pop' of 'dict' objects}
        9    0.000    0.000    0.000    0.000 _dcsrch.py:174(__init__)
       14    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 _minimize.py:1066(standardize_constraints)
       10    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
       21    0.000    0.000    0.000    0.000 _function_base_impl.py:894(_copy_dispatcher)
        1    0.000    0.000    0.000    0.000 {method 'flatten' of 'numpy.ndarray' objects}
       10    0.000    0.000    0.000    0.000 enum.py:1292(value)
       10    0.000    0.000    0.000    0.000 _funcs.py:12(atleast_nd)
       20    0.000    0.000    0.000    0.000 numeric.py:2456(_array_equal_dispatcher)
        9    0.000    0.000    0.000    0.000 {built-in method builtins.abs}
        1    0.000    0.000    0.000    0.000 {method 'reshape' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 _array_api.py:114(array_namespace)
        1    0.000    0.000    0.000    0.000 {method 'any' of 'numpy.ndarray' objects}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        9    0.000    0.000    0.000    0.000 _util.py:1061(_call_callback_maybe_halt)
        8    0.000    0.000    0.000    0.000 {built-in method builtins.callable}
       11    0.000    0.000    0.000    0.000 shape_base.py:17(_atleast_1d_dispatcher)
       10    0.000    0.000    0.000    0.000 fromnumeric.py:3047(_max_dispatcher)
        1    0.000    0.000    0.001    0.001 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'update' of 'set' objects}
        1    0.000    0.000    0.000    0.000 _base.py:1401(issparse)
        1    0.000    0.000    0.000    0.000 _differentiable_functions.py:35(_wrapper_grad)
        1    0.000    0.000    0.000    0.000 {built-in method numpy.zeros}
        1    0.000    0.000    0.000    0.000 _optimize.py:160(_check_positive_definite)
        1    0.000    0.000    0.000    0.000 {method 'ravel' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 _differentiable_functions.py:13(_wrapper_fun)
        1    0.000    0.000    0.000    0.000 _methods.py:58(_any)
        1    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'lower' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        2    0.000    0.000    0.000    0.000 {built-in method _operator.index}
        1    0.000    0.000    0.000    0.000 {method 'setdefault' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 _linalg.py:128(isComplexType)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:2472(_any_dispatcher)
        1    0.000    0.000    0.000    0.000 shape_base.py:76(_atleast_2d_dispatcher)
        1    0.000    0.000    0.000    0.000 _optimize.py:283(hess)
        1    0.000    0.000    0.000    0.000 {method 'values' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 _optimize.py:88(_wrap_callback)
        1    0.000    0.000    0.000    0.000 _differentiable_functions.py:258(nfev)
        1    0.000    0.000    0.000    0.000 _differentiable_functions.py:262(ngev)
        1    0.000    0.000    0.000    0.000 _optimize.py:175(_check_unknown_options)
        1    0.000    0.000    0.000    0.000 _linalg.py:2571(_norm_dispatcher)

Profiling - BFGS (pyinstrument)

from pyinstrument import Profiler

f, grad, hess = mk_quad(0.7)

profiler = Profiler(interval=0.00001)

profiler.start()
opt = optimize.minimize(fun = f, x0 = (1.6, 1.1), jac=grad, method="BFGS", tol=1e-11)
p = profiler.stop()

profiler.print(show_all=True)

Profiling - Nelder-Mead

from pyinstrument import Profiler

f, grad, hess = mk_quad(0.7)

profiler = Profiler(interval=0.00001)

profiler.start()
opt = optimize.minimize(fun = f, x0 = (1.6, 1.1), method="Nelder-Mead", tol=1e-11)
p = profiler.stop()

profiler.print(show_all=True)

optimize.minimize() output

f, grad, hess = mk_quad(0.7)
optimize.minimize(fun = f, x0 = (1.6, 1.1), 
                  jac=grad, method="BFGS")
  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 1.2739256453439323e-11
        x: [-5.318e-07 -8.843e-06]
      nit: 6
      jac: [-3.510e-07 -2.860e-06]
 hess_inv: [[ 1.515e+00 -3.438e-03]
            [-3.438e-03  3.035e+00]]
     nfev: 7
     njev: 7
optimize.minimize(fun = f, x0 = (1.6, 1.1), 
                  jac=grad, hess=hess, 
                  method="Newton-CG")
 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: 2.3418652989289317e-12
       x: [ 0.000e+00  3.806e-06]
     nit: 11
     jac: [ 0.000e+00  4.102e-06]
    nfev: 12
    njev: 12
    nhev: 11

optimize.minimize() output (cont.)

optimize.minimize(fun = f, x0 = (1.6, 1.1), 
                  jac=grad, method="CG")
 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: 1.4450021261144105e-32
       x: [-1.943e-16 -1.110e-16]
     nit: 2
     jac: [-1.282e-16 -3.590e-17]
    nfev: 5
    njev: 5
optimize.minimize(fun = f, x0 = (1.6, 1.1), 
                  jac=grad, method="Nelder-Mead")
       message: Optimization terminated successfully.
       success: True
        status: 0
           fun: 2.3077013477040082e-10
             x: [ 1.088e-05  3.443e-05]
           nit: 46
          nfev: 89
 final_simplex: (array([[ 1.088e-05,  3.443e-05],
                       [ 1.882e-05, -3.825e-05],
                       [-3.966e-05, -3.147e-05]]), array([ 2.308e-10,  3.534e-10,  6.791e-10]))

Collect

def run_collect(name, x0, cost_func, *args, tol=1e-8, skip=[]):
  f, grad, hess = cost_func(*args)
  methods = define_methods(x0, f, grad, hess, tol)
  
  res = []
  for method in methods:
    if method in skip: continue
    
    x = methods[method]()
    
    d = {
      "name":    name,
      "method":  method,
      "nit":     x["nit"],
      "nfev":    x["nfev"],
      "njev":    x.get("njev"),
      "nhev":    x.get("nhev"),
      "success": x["success"]
      #"message": x["message"]
    }
    res.append( pd.DataFrame(d, index=[1]) )
  
  return pd.concat(res)
df = pd.concat([
  run_collect(name, (1.6, 1.1), cost_func, arg, skip=['naive_newton', 'naive_cg']) 
  for name, cost_func, arg in zip(
    ("Well-cond quad", "Ill-cond quad", "Rosenbrock"), 
    (mk_quad, mk_quad, mk_rosenbrock), 
    (0.7, 0.02, None)
  )
])

df
             name          method  nit  nfev  njev  nhev  success
1  Well-cond quad              CG    2     5     5  None     True
1  Well-cond quad       newton-cg    5     6    13     0     True
1  Well-cond quad  newton-cg w/ H   15    15    15    15     True
1  Well-cond quad            bfgs    8     9     9  None     True
1  Well-cond quad      bfgs w/o G    8    27     9  None     True
1  Well-cond quad        l-bfgs-b    6    21     7  None     True
1  Well-cond quad     nelder-mead   76   147  None  None     True
1   Ill-cond quad              CG    9    17    17  None     True
1   Ill-cond quad       newton-cg    3     4     9     0     True
1   Ill-cond quad  newton-cg w/ H   72   136   136    72     True
1   Ill-cond quad            bfgs    5    11    11  None     True
1   Ill-cond quad      bfgs w/o G    5    33    11  None     True
1   Ill-cond quad        l-bfgs-b    5    30    10  None     True
1   Ill-cond quad     nelder-mead  102   198  None  None     True
1      Rosenbrock              CG   17    52    48  None     True
1      Rosenbrock       newton-cg   18    22    60     0     True
1      Rosenbrock  newton-cg w/ H   17    21    21    17     True
1      Rosenbrock            bfgs   23    26    26  None     True
1      Rosenbrock      bfgs w/o G   23    78    26  None     True
1      Rosenbrock        l-bfgs-b   19    75    25  None     True
1      Rosenbrock     nelder-mead   96   183  None  None     True

Exercise 1

Try minimizing the following function using different optimization methods starting from \(x_0 = [0,0]\), which method(s) appear to work best?

\[ \begin{align} f(x) = \exp(x_1-1) + \exp(-x_2+1) + (x_1-x_2)^2 \end{align} \]

MVN Example

MVN density cost function

For an \(n\)-dimensional multivariate normal we define
the \(n \times 1\) vectors \(x\) and \(\mu\) and the \(n \times n\)
covariance matrix \(\Sigma\),

\[ \begin{align} f(x) =& \det(2\pi\Sigma)^{-1/2} \\ & \exp \left[-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu) \right] \\ \\ \nabla f(x) =& -f(x) \Sigma^{-1}(x-\mu) \\ \\ \nabla^2 f(x) =& f(x) \left( \Sigma^{-1}(x-\mu)(x-\mu)^T\Sigma^{-1} - \Sigma^{-1}\right) \\ \end{align} \]

Our goal will be to find the mode (maximum) of this density.

def mk_mvn(mu, Sigma):
  Sigma_inv = np.linalg.inv(Sigma)
  norm_const = 1 / (np.sqrt(np.linalg.det(2*np.pi*Sigma)))
  
  # Returns the negative density (since we want the max not min)
  def f(x):
    x_m = x - mu
    return -(norm_const * 
      np.exp( -0.5 * (x_m.T @ Sigma_inv @ x_m).item() ))
  
  def grad(x):
    return (-f(x) * Sigma_inv @ (x - mu))
  
  def hess(x):
    n = len(x)
    x_m = x - mu
    return f(x) * ((Sigma_inv @ x_m).reshape((n,1)) @ (x_m.T @ Sigma_inv).reshape((1,n)) - Sigma_inv)
  
  return f, grad, hess

Gradient checking

One of the most common issues when implementing an optimizer is to get the gradient calculation wrong which can produce problematic results. It is possible to numerically check the gradient function by comparing results between the gradient function and finite differences from the objective function via optimize.check_grad().

# 2d
f, grad, hess = mk_mvn(np.zeros(2), np.eye(2,2))
optimize.check_grad(f, grad, np.zeros(2))
np.float64(2.634178031930877e-09)
optimize.check_grad(f, grad, np.ones(2))
np.float64(5.213238144735062e-10)
# 5d
f, grad, hess = mk_mvn(np.zeros(5), np.eye(5,5))
optimize.check_grad(f, grad, np.zeros(5))
np.float64(2.6031257322754127e-10)
optimize.check_grad(f, grad, np.ones(5))
np.float64(1.725679820308689e-11)
# 10d
f, grad, hess = mk_mvn(np.zeros(10), np.eye(10))
optimize.check_grad(f, grad, np.zeros(10))
np.float64(2.8760747774580336e-12)
optimize.check_grad(f, grad, np.ones(10))
np.float64(2.850398669793798e-14)
# 20d
f, grad, hess = mk_mvn(np.zeros(20), np.eye(20))
optimize.check_grad(f, grad, np.zeros(20))
np.float64(4.965068306494546e-16)
optimize.check_grad(f, grad, np.ones(20))
np.float64(1.0342002372572841e-20)

Gradient checking (wrong gradient)

wrong_grad = lambda x: 2*grad(x)
# 2d
f, grad, hess = mk_mvn(np.zeros(2), np.eye(2,2))
optimize.check_grad(f, wrong_grad, [0,0])
np.float64(2.634178031930877e-09)
optimize.check_grad(f, wrong_grad, [1,1])
np.float64(0.08280196633767578)
# 5d
f, grad, hess = mk_mvn(np.zeros(5), np.eye(5))
optimize.check_grad(f, wrong_grad, np.zeros(5))
np.float64(2.6031257322754127e-10)
optimize.check_grad(f, wrong_grad, np.ones(5))
np.float64(0.0018548087267515347)

Hessian checking

Note since the gradient of the gradient / jacobian is the hessian we can use this function to check our implementation of the hessian as well, just use grad() as func and hess() as grad.

# 2d
f, grad, hess = mk_mvn(np.zeros(2), np.eye(2,2))
optimize.check_grad(grad, hess, [0,0])
np.float64(3.925231146709438e-17)
optimize.check_grad(grad, hess, [1,1])
np.float64(8.399162985270666e-10)
# 5d
f, grad, hess = mk_mvn(np.zeros(5), np.eye(5))
optimize.check_grad(grad, hess, np.zeros(5))
np.float64(3.878959614448864e-18)
optimize.check_grad(grad, hess, np.ones(5))
np.float64(3.8156075963144067e-11)

Unit MVNs

df = pd.concat([
  run_collect(
    name, np.ones(n), mk_mvn, 
    np.zeros(n), np.eye(n), 
    tol=1e-10, 
    skip=['naive_newton', 'naive_cg', 'bfgs w/o G', 'newton-cg w/ H']
  ) 
  for name, n in zip(
    ("5d", "10d", "20d", "30d"), 
    (5, 10, 20, 30)
  )
])
df
  name       method   nit  nfev  njev  nhev  success
1   5d           CG     7    52    52  None     True
1   5d    newton-cg     3     7    10     0     True
1   5d         bfgs     3    12    12  None     True
1   5d     l-bfgs-b     4    54     9  None     True
1   5d  nelder-mead   290   523  None  None     True
1  10d           CG     2    24    24  None     True
1  10d    newton-cg     2     8    10     0     True
1  10d         bfgs     3    19    19  None     True
1  10d     l-bfgs-b     3   110    10  None     True
1  10d  nelder-mead  1403  2000  None  None    False
1  20d           CG     0     1     1  None     True
1  20d    newton-cg     1     1     1     0     True
1  20d         bfgs     0     1     1  None     True
1  20d     l-bfgs-b     0    21     1  None     True
1  20d  nelder-mead  3217  4000  None  None    False
1  30d           CG     0     1     1  None     True
1  30d    newton-cg     1     1     1     0     True
1  30d         bfgs     0     1     1  None     True
1  30d     l-bfgs-b     0    31     1  None     True
1  30d  nelder-mead  5097  6000  None  None    False

Performance (Unit MVNs)

Adding correlation

def build_Sigma(n, corr=0.5):
  S = np.full((n,n), corr)
  np.fill_diagonal(S, 1)
  return S

df = pd.concat([
  run_collect(
    name, np.ones(n), mk_mvn, 
    np.zeros(n), build_Sigma(n), 
    tol=1e-10, 
    skip=['naive_newton', 'naive_cg', 'bfgs w/o G', 'newton-cg w/ H']
  ) 
  for name, n in zip(
    ("5d", "10d", "20d", "30d"), 
    (5, 10, 20, 30)
  )
])
df
  name       method   nit  nfev  njev  nhev  success
1   5d           CG    17    80    80  None     True
1   5d    newton-cg     4     6    10     0     True
1   5d         bfgs     4    10    10  None     True
1   5d     l-bfgs-b     4    54     9  None     True
1   5d  nelder-mead   224   427  None  None     True
1  10d           CG     4    31    31  None     True
1  10d    newton-cg     4     5     9     0     True
1  10d         bfgs     2    12    12  None     True
1  10d     l-bfgs-b     3    77     7  None     True
1  10d  nelder-mead  1184  1802  None  None     True
1  20d           CG     2    27    27  None     True
1  20d    newton-cg     2     5     7     0     True
1  20d         bfgs     2    17    17  None     True
1  20d     l-bfgs-b     2   105     5  None     True
1  20d  nelder-mead  2745  4000  None  None    False
1  30d           CG     1    18    18  None     True
1  30d    newton-cg     1     1     1     0     True
1  30d         bfgs     1    18    18  None     True
1  30d     l-bfgs-b     1    93     3  None     True
1  30d  nelder-mead  4687  6000  None  None    False

What’s going on? (good)

n = 5
f, grad, hess = mk_mvn(np.zeros(n), build_Sigma(n))
optimize.minimize(f, np.ones(n), jac=grad, 
                  method="CG", tol=1e-9)
 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: -0.023337250777292103
       x: [ 1.082e-07  1.059e-07  1.076e-07  1.088e-07  1.077e-07]
     nit: 14
     jac: [ 8.637e-10  7.556e-10  8.374e-10  8.917e-10  8.381e-10]
    nfev: 67
    njev: 67
optimize.minimize(f, np.ones(n), jac=grad, 
                  method="CG", tol=1e-10)
 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: -0.023337250777292328
       x: [ 9.586e-10  2.948e-10  7.975e-10  1.131e-09  8.016e-10]
     nit: 17
     jac: [ 1.376e-11 -1.723e-11  6.238e-12  2.179e-11  6.430e-12]
    nfev: 80
    njev: 80
optimize.minimize(f, np.ones(n), jac=grad, 
                  method="CG", tol=1e-11)
 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: -0.023337250777292328
       x: [ 6.697e-10  6.565e-10  6.665e-10  6.731e-10  6.666e-10]
     nit: 18
     jac: [ 5.335e-12  4.720e-12  5.186e-12  5.494e-12  5.189e-12]
    nfev: 83
    njev: 83

What’s going on? (okay)

n = 20
f, grad, hess = mk_mvn(np.zeros(n), build_Sigma(n))
optimize.minimize(f, np.ones(n), jac=grad, 
                  method="CG", tol=1e-9)
 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: -2.330191334018497e-06
       x: [-3.221e-04 -3.221e-04 ... -3.221e-04 -3.221e-04]
     nit: 2
     jac: [-7.148e-11 -7.148e-11 ... -7.148e-11 -7.148e-11]
    nfev: 27
    njev: 27
optimize.minimize(f, np.ones(n), jac=grad, 
                  method="CG", tol=1e-10)
 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: -2.330191334018497e-06
       x: [-3.221e-04 -3.221e-04 ... -3.221e-04 -3.221e-04]
     nit: 2
     jac: [-7.148e-11 -7.148e-11 ... -7.148e-11 -7.148e-11]
    nfev: 27
    njev: 27
optimize.minimize(f, np.ones(n), jac=grad, 
                  method="CG", tol=1e-11)
 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: -2.3301915597315495e-06
       x: [-4.506e-05 -4.506e-05 ... -4.506e-05 -4.506e-05]
     nit: 2884
     jac: [-9.999e-12 -9.999e-12 ... -9.999e-12 -9.999e-12]
    nfev: 66313
    njev: 66313

What’s going on? (bad)

n = 30
f, grad, hess = mk_mvn(np.zeros(n), build_Sigma(n))
optimize.minimize(f, np.ones(n), jac=grad, 
                  method="CG", tol=1e-9)
 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: -2.381146302597316e-09
       x: [ 1.000e+00  1.000e+00 ...  1.000e+00  1.000e+00]
     nit: 0
     jac: [ 1.536e-10  1.536e-10 ...  1.536e-10  1.536e-10]
    nfev: 1
    njev: 1
optimize.minimize(f, np.ones(n), jac=grad, 
                  method="CG", tol=1e-10)
 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: -6.180056227752818e-09
       x: [ 1.203e-01  1.203e-01 ...  1.203e-01  1.203e-01]
     nit: 1
     jac: [ 4.795e-11  4.795e-11 ...  4.795e-11  4.795e-11]
    nfev: 18
    njev: 18
optimize.minimize(f, np.ones(n), jac=grad, 
                  method="CG", tol=1e-11)
 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: -6.26701117075865e-09
       x: [-5.021e-03 -5.021e-03 ... -5.021e-03 -5.021e-03]
     nit: 2
     jac: [-2.030e-12 -2.030e-12 ... -2.030e-12 -2.030e-12]
    nfev: 35
    njev: 35

Options (bfgs)

optimize.show_options(solver="minimize", method="bfgs")
Minimization of scalar function of one or more variables using the
BFGS algorithm.

Options
-------
disp : bool
    Set to True to print convergence messages.
maxiter : int
    Maximum number of iterations to perform.
gtol : float
    Terminate successfully if gradient norm is less than `gtol`.
norm : float
    Order of norm (Inf is max, -Inf is min).
eps : float or ndarray
    If `jac is None` the absolute step size used for numerical
    approximation of the jacobian via forward differences.
return_all : bool, optional
    Set to True to return a list of the best solution at each of the
    iterations.
finite_diff_rel_step : None or array_like, optional
    If ``jac in ['2-point', '3-point', 'cs']`` the relative step size to
    use for numerical approximation of the jacobian. The absolute step
    size is computed as ``h = rel_step * sign(x) * max(1, abs(x))``,
    possibly adjusted to fit into the bounds. For ``jac='3-point'``
    the sign of `h` is ignored. If None (default) then step is selected
    automatically.
xrtol : float, default: 0
    Relative tolerance for `x`. Terminate successfully if step size is
    less than ``xk * xrtol`` where ``xk`` is the current parameter vector.
c1 : float, default: 1e-4
    Parameter for Armijo condition rule.
c2 : float, default: 0.9
    Parameter for curvature condition rule.
hess_inv0 : None or ndarray, optional
    Initial inverse hessian estimate, shape (n, n). If None (default) then
    the identity matrix is used.

Notes
-----
Parameters `c1` and `c2` must satisfy ``0 < c1 < c2 < 1``.

If minimization doesn't complete successfully, with an error message of
``Desired error not necessarily achieved due to precision loss``, then
consider setting `gtol` to a higher value. This precision loss typically
occurs when the (finite difference) numerical differentiation cannot provide
sufficient precision to satisfy the `gtol` termination criterion.
This can happen when working in single precision and a callable jac is not
provided. For single precision problems a `gtol` of 1e-3 seems to work.

Options (Nelder-Mead)

optimize.show_options(solver="minimize", method="Nelder-Mead")
Minimization of scalar function of one or more variables using the
Nelder-Mead algorithm.

Options
-------
disp : bool
    Set to True to print convergence messages.
maxiter, maxfev : int
    Maximum allowed number of iterations and function evaluations.
    Will default to ``N*200``, where ``N`` is the number of
    variables, if neither `maxiter` or `maxfev` is set. If both
    `maxiter` and `maxfev` are set, minimization will stop at the
    first reached.
return_all : bool, optional
    Set to True to return a list of the best solution at each of the
    iterations.
initial_simplex : array_like of shape (N + 1, N)
    Initial simplex. If given, overrides `x0`.
    ``initial_simplex[j,:]`` should contain the coordinates of
    the jth vertex of the ``N+1`` vertices in the simplex, where
    ``N`` is the dimension.
xatol : float, optional
    Absolute error in xopt between iterations that is acceptable for
    convergence.
fatol : number, optional
    Absolute error in func(xopt) between iterations that is acceptable for
    convergence.
adaptive : bool, optional
    Adapt algorithm parameters to dimensionality of problem. Useful for
    high-dimensional minimization [1]_.
bounds : sequence or `Bounds`, optional
    Bounds on variables. There are two ways to specify the bounds:

    1. Instance of `Bounds` class.
    2. Sequence of ``(min, max)`` pairs for each element in `x`. None
       is used to specify no bound.

    Note that this just clips all vertices in simplex based on
    the bounds.

References
----------
.. [1] Gao, F. and Han, L.
   Implementing the Nelder-Mead simplex algorithm with adaptive
   parameters. 2012. Computational Optimization and Applications.
   51:1, pp. 259-277

SciPy implementation

The following code comes from SciPy’s minimize() implementation:

if tol is not None:
  options = dict(options)
  if meth == 'nelder-mead':
      options.setdefault('xatol', tol)
      options.setdefault('fatol', tol)
  if meth in ('newton-cg', 'powell', 'tnc'):
      options.setdefault('xtol', tol)
  if meth in ('powell', 'l-bfgs-b', 'tnc', 'slsqp'):
      options.setdefault('ftol', tol)
  if meth in ('bfgs', 'cg', 'l-bfgs-b', 'tnc', 'dogleg',
              'trust-ncg', 'trust-exact', 'trust-krylov'):
      options.setdefault('gtol', tol)
  if meth in ('cobyla', '_custom'):
      options.setdefault('tol', tol)
  if meth == 'trust-constr':
      options.setdefault('xtol', tol)
      options.setdefault('gtol', tol)
      options.setdefault('barrier_tol', tol)

Some general advice

  • Having access to the gradient is almost always helpful / necessary

  • Having access to the hessian can be helpful, but usually does not significantly improve things

  • The curse of dimensionality is real

    • Be careful with tol - it means different things for different methods
  • In general, BFGS or L-BFGS should be a first choice for most problems (either well- or ill-conditioned)

    • CG can perform better for well-conditioned problems with cheap function evaluations

Maximum Likelihood example

Normal MLE

from scipy.stats import norm

n = norm(-3.2, 1.25)
x = n.rvs(size=100, random_state=1234)
x.round(2)
array([-2.61, -4.69, -1.41, -3.59, -4.1 , -2.09, -2.13, -4.  , -3.18,
       -6.  , -1.76, -1.96, -2.01, -5.73, -3.62, -3.2 , -2.69, -2.84,
       -1.55, -5.13, -3.45, -4.02, -2.96, -2.51, -1.55, -3.79, -2.36,
       -5.47, -3.43, -1.88, -3.7 , -2.78, -1.89, -1.89, -2.12, -3.35,
       -3.04, -3.6 , -2.15, -0.21, -3.1 , -3.91, -3.15, -5.79, -2.89,
       -4.32, -3.37, -3.18, -2.26, -2.93, -2.15, -5.01, -4.95, -3.33,
       -3.89, -3.38, -2.76, -3.24, -2.49, -1.27, -4.42, -3.29, -2.82,
       -3.46, -1.91, -6.2 , -0.66, -4.63, -2.94, -2.32, -4.18, -2.62,
       -2.32, -2.55, -4.36, -0.69, -2.92, -4.64, -2.41, -3.15, -2.62,
       -7.65, -1.55, -3.01, -2.99, -3.74, -2.24, -1.97, -2.86, -1.46,
       -3.1 , -3.7 , -4.48, -3.93, -2.18, -3.3 , -3.63, -2.54, -4.54,
       -3.84])
{'µ': x.mean(), 'σ': x.std()}
{'µ': np.float64(-3.156109646093205), 'σ': np.float64(1.2446060629192537)}
mle_norm = lambda θ: -np.sum(
  norm.logpdf(x, loc=θ[0], scale=θ[1])
)

mle_norm([0,1])
np.float64(667.3974708213642)
mle_norm([-3, 1])
np.float64(170.56457699340282)
mle_norm([-3.2, 1.25])
np.float64(163.83926813257395)

Minimizing

optimize.minimize(mle_norm, x0=[0,1], method="bfgs")
  message: Desired error not necessarily achieved due to precision loss.
  success: False
   status: 2
      fun: nan
        x: [-1.436e+04 -3.533e+03]
      nit: 2
      jac: [       nan        nan]
 hess_inv: [[ 9.443e-01  2.340e-01]
            [ 2.340e-01  5.905e-02]]
     nfev: 339
     njev: 113

Adding constraints

def mle_norm2(θ): 
  if θ[1] <= 0:
    return np.inf
  else:
    return -np.sum(norm.logpdf(x, loc=θ[0], scale=θ[1]))
optimize.minimize(mle_norm2, x0=[0,1], method="bfgs")
  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 163.77575977255518
        x: [-3.156e+00  1.245e+00]
      nit: 9
      jac: [ 0.000e+00  0.000e+00]
 hess_inv: [[ 1.475e-02 -1.179e-04]
            [-1.179e-04  7.723e-03]]
     nfev: 43
     njev: 14

Specifying Bounds

It is also possible to specify bounds via bounds but this is only available for certain optimization methods.

optimize.minimize(
  mle_norm, x0=[0,1], method="l-bfgs-b",
  bounds = [(-1e16, 1e16), (1e-16, 1e16)]
)
  message: CONVERGENCE: RELATIVE REDUCTION OF F <= FACTR*EPSMCH
  success: True
   status: 0
      fun: 163.77575977287245
        x: [-3.156e+00  1.245e+00]
      nit: 10
      jac: [ 2.046e-04  0.000e+00]
     nfev: 69
     njev: 23
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>

Exercise 2

Using optimize.minimize() recover the shape and scale parameters for these data using MLE.

from scipy.stats import gamma

g = gamma(a=2.0, scale=2.0)
x = g.rvs(size=100, random_state=1234)
x.round(2)
array([ 4.7 ,  1.11,  1.8 ,  6.19,  3.37,  0.25,  6.45,  0.36,  4.49,
        4.14,  2.84,  1.91,  8.03,  2.26,  2.88,  6.88,  6.84,  6.83,
        6.1 ,  3.03,  3.67,  2.57,  3.53,  2.07,  4.01,  1.51,  5.69,
        3.92,  6.01,  0.82,  2.11,  2.97,  5.02,  9.13,  4.19,  2.82,
       11.81,  1.17,  1.69,  4.67,  1.47, 11.67,  5.25,  3.44,  8.04,
        3.74,  5.73,  6.58,  3.54,  2.4 ,  1.32,  2.04,  2.52,  4.89,
        4.14,  5.02,  4.75,  8.24,  7.6 ,  1.  ,  6.14,  0.58,  2.83,
        2.88,  5.42,  0.5 ,  3.46,  4.46,  1.86,  4.59,  2.24,  2.62,
        3.99,  3.74,  5.27,  1.42,  0.56,  7.54,  5.5 ,  1.58,  5.49,
        6.57,  4.79,  5.84,  8.21,  1.66,  1.53,  4.27,  2.57,  1.48,
        5.23,  3.84,  3.15,  2.1 ,  3.71,  2.79,  0.86,  8.52,  4.36,
        3.3 ])